Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

source-mysql: Decode text correctly in non-UTF8 character sets #1979

Merged
merged 11 commits into from
Sep 30, 2024

Conversation

willdonnelly
Copy link
Member

@willdonnelly willdonnelly commented Sep 24, 2024

Description:

This PR fixes source-mysql to correctly decode text columns with non-UTF8 character sets. This fixes #1951

Previously backfilled values were correct, because MySQL always sends us those in UTF-8, but replicated values were occasionally incorrect (or at least, sub-optimal) because the bytes stored in the binlog are in whatever character set the column uses, and at a certain point we just cast []byte -> string which results in any non-ASCII bytes getting replaced with U+FFFD REPLACEMENT CHARACTER.

The fix is:

  1. Keep track of the character set of each column (this is the hard part, there's a lot of subtleties to making it work correctly even in oddball cases like a DDL ALTER TABLE query occurring after the table backfill which adds a new column where the character set could be explicit, could be implied by an explicit collation, or could be the table default).
  2. Modify the decoding logic to apply an appropriate []byte -> string decode function to text column bytes received via replication, based on the character set of the column. This part is reasonably straightforward, but required adding an extra parameter to the value translation code to distinguish between backfilled values (which are always UTF-8) and replicated ones (where we need to do all this fanciness).
  3. Implement a handful of decoding functions. Currently we've got UTF-8, latin-1, and UCS-2 decoders. There are a couple more which might theoretically come up in the future but which I didn't think warranted worrying about right now. Honestly even UCS-2 is unlikely to ever get used in the real world, I just added it because as a double-byte character set it was a good way of exercising some of the edge cases and clearly demonstrating working vs non-working behavior.

I have opted to make the connector default to a UTF-8 decoder when faced with an unknown or unrecognized charset rather than failing the capture. This would probably be the wrong choice if the connector were new, but right now we don't know if there are any captures in production using some other character set which I overlooked, so the safest behavior is to keep doing the same thing as before this PR (decoding the bytes by casting to a UTF-8 string directly) while logging an error. Later on I can go check to see if that error actually occurred in production anywhere, add support for the offending character sets, and consider making "unknown charset" a fatal error at that time.

Workflow steps:

Most captures don't need to care or do anything. Captures will continue operating as normal for preexisting bindings (because their metadata for all tables and columns has no charset information, which means an implicit default of UTF-8). New captures, or new bindings or backfills on existing captures, will reinitialize that metadata and pick up the appropriate charset information. In most cases this will have no impact because the charset really is UTF-8.

Captures of latin-1 or other non-Unicode text columns which actually contain non-ASCII code points currently being captured as U+FFFD REPLACEMENT CHARACTER can be fixed to capture the correct values for these non-ASCII characters by re-backfilling the binding.


This change is Reviewable

This is a new test case which exercises `text` columns with various
different colations / character sets storing some test strings with
interesting characters from various languages.

I believe the current set of collations and test strings is enough
to demonstrate the known issues with our current handling of these
edge cases:

  - A `latin1` collation stores strings in the latin-1 character
    set and then we cast those raw bytes to a string which causes
    all of the non-ASCII characters to be replaced with U+FFFD.
  - A `ucs2` collation stores strings in the UCS-2 DBCS and so for
    similar reasons all replicated values are terribly mangled.
  - A `binary` collation is apparently captured as base64 bytes,
    because apparently if you tell MySQL `TEXT COLLATE binary` it
    creates a `BLOB` column. This is not an error but just seemed
    worth noting and including in the test here. The base64'd text
    appears to be a faithful base64 representation of the input as
    a UTF-8 string.
Modifies the column-discovery and primary-key-discovery queries
to apply the same "not in a system schema" filtering that the
table discovery query has, and then modifies column discovery
so that the raw information is logged for each column.
This commit adds most of the necessary plumbing to keep track of
the character set of a text column and apply a charset-aware
decode function instead of just casting the bytes to a string.

However it does not actually implement the proper decoding and
instead just still does `var str = string(bs)` at the appropriate
spot with a TODO noting that's wrong. The next commit will come
in and actually implement proper decoding.
Currently this doesn't work and I've stubbed out the logic with
a hard-coded default of `utf8mb4` but now there's a test case
which will show the corrected values when I fix that.
This works just fine, which is to be expected because we haven't
changed anything and we already knew that latin-1 character set
text columns work fine under backfills.
This more or less preserves the old behavior for any charsets we
haven't thought to add to the decoders map, but logs an error so
I can check for it in a few days and add any others we might be
missing.
And use that collation when processing DDL alterations which
don't explicitly specify another collation or charset.
In this case we want to apply the same "charset from collation"
mapping function that we use during discovery. Now the hierarchy
of column charsets goes:

1. Explicit CHARSET declaration
2. Explicit COLLATE declaration
3. Default for the table (which omits utf8mb4 in some cases)
4. Default to utf8mb4 as the last resort
@willdonnelly willdonnelly added the change:planned This is a planned change label Sep 24, 2024
@willdonnelly willdonnelly requested a review from a team September 24, 2024 19:58
Copy link
Member

@jgraettinger jgraettinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

It was always the plan to include the table charset in the table
metadata unconditionally, I just left that part for a followup
commit to keep the diffs separate.

This removes the `"utf8mb4" -> ""` omission from the replication
code so that it's always made explicit what charset a table is
using in metadata initialized after this change. The default of
`"" -> "utf8mb4"` still exists in the DDL alteration datatype
translation so that will always be explicit for newly added
columns, and likewise the string decoding function defaults
`"" -> "utf8mb4"` so that old metadata works, but after this
change we always specify charset information explicitly when
generating new metadata.
@willdonnelly willdonnelly merged commit cf8ef4d into main Sep 30, 2024
50 of 53 checks passed
@willdonnelly willdonnelly deleted the wgd/debugging-20240916 branch September 30, 2024 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
change:planned This is a planned change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

source-mysql: Fix replication of text columns with non-Unicode character sets
2 participants